-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the order of local tags #26
base: master
Are you sure you want to change the base?
Conversation
tests/test_metadata_utils.py
Outdated
PROCESSORS["html"] = HtmlProcessor | ||
|
||
text1, mask1 = add_local_metadata_to_text(self.examples[3], cfg) | ||
target_text = '<a>useless text </a><div><a><div><div></div></div></a></div><h1><i>The Walking Dead</i> (season 8)</h1>\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the change I propose in this PR the result was here:
<a>useless text </div></a></div></a></div><h1><i><div><a><div><div>The Walking Dead</i> (season 8)</h1>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just curious, is this (same index) a limitation of the source dataset? |
No no, I compile the dataset myself from Natural Questions and put it in the format proposed here. |
Ah, I see. I misunderstood. Thank you for the clarification! |
🙂
On the parser side I think it's really feasible, what's more blocking in my opinion is where to store the information in the |
I see. Funny you should mention that, because my first comment was gonna be "having a new property or even structure of jsonl for it" but then I didn't send it for exactly the same reason. lol That being said, I wonder if entities will also need that kind of nested construct. |
Hi @SaulLu, as discussed yesterday, I fully agree that this is an important issue that needs to be fixed. However, I'm not sure I fully understand how your proposed solution works - would you perhaps have 15 minutes or so sometime tomorrow (ideally, between 15:00 and 17:00 CEST) to go through this pull request with me? :) |
Summary of an offline discussion: unfortunately this PR does not solve all the concerns about the order of the local HTML metadata to be added. For example, @timoschick has rightly shown that Another pathological case is:
I'm converting this PR into a draft, for now, the time to think about another workaround. |
I propose in this PR to change the way in which local tags are added to the text.
Motivation
HTML defines a tree. In this sense, it is important that the rules for opening and closing tags are respected. For example, the following html is not valid
"<h1> test </h1></a><a>"
. With the current way of adding local tags, this rule may not be respected because of self closing tags which can have their opening and closing tags on the same character index.As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.
Implementation details
Review
I would be very happy to have your opinion on this PR